1. Introduction

In Project 1, we examined the Airbnb dataset and conducted basic exploratory data analysis (EDA) with statistical inference to draw baseline relationships between different variables. While exercises in correlation using t-tests, chi-squared tests, and ANOVA tests yielded interesting results, we wanted to dive deeper into the stories behind the numbers. For that, we turn to regressions in order to determine causation between variables and introduce a dataset on crime in order to better extrapolate results to the broader population of Airbnbs. The following paper will continue by first providing a high-level EDA for the Airbnb and crime datasets. Furthermore, the paper will proceed by conducting a variety of regression techniques. We begin first with Airbnb variables only, examining what factors affect listing prices. Then, we overlay the crime data onto the Airbnb data to investigate the effects of crime on prices. The penultimate section of this paper will compare the regression results against one another in order to determine the best model(s). Lastly, this paper will conclude with a summary of findings and areas for further research.

2. Data Sources

In this section, we provide a summary of the data sources used in this analysis.

2.1 Airbnb Data

The Airbnb data for this project is from InsideAirbnb.com. This website contains information sourced by Murray Cox, who utilized Airbnb’s public application programming interface (API) to mine this data. Originally, Cox scraped this data to identify illegal listings in New York City. He has since expanded his data set offerings to cities across the world and makes this datum available for open source use and research.

Although Cox is potentially a biased source, due to his activist leanings, his datasets originate from Airbnb themselves and are thoroughly documented. We also considered using data from Airbnb directly, however, other studies have shown that this data is outdated and biased in that it only shows the positive side of Airbnb. Mr. Cox, however, seeks to use the company’s own data is that he scraped from the website itself through a well-documented procedure to explore how Airbnb is really affecting our community. Therefore, we decided that the dataset was reliable because of the author’s documentation and his purpose for releasing it.

2.2 Crime Data

The crime dataset is sourced from OpenData DC. The dataset contains a subset of locations and attributes of incidents reported in the ASAP (Analytical Services Application) crime report database by the District of Columbia Metropolitan Police Department (MPD).

This data is shared via an automated process where addresses are geocoded to the District’s Master Address Repository and assigned to the appropriate street block. Block locations for some crime points could not be automatically assigned resulting in (0,0) for (x,y) coordinates.

3. Exploratory Data Analysis

To begin, we present various summary statistics for the two datasets (Airbnb and crime) that we are investigating. Below are the structure printouts for both datasets, beginning with Airbnb, followed by crime:

In the Airbnb (“listings”) dataset, there are 9,126 observations and 17 variables. The Airbnb dataset is relatively comprehensive, consisting of various qualitative variables including unique ID, name, and neighbourhood. Additionally, there are a number of quantitative variables such as price and number of reviews. More importantly, the listings dataset contains latitude and longitude coordinates that will serve as the link to the crime dataset.

On the other hand, the crime dataset contains 29,045 observations and 26 variables. Furthermore, this dataset shows 9 types of offenses as well as the method of crime (gun, knife, others) and time of day (day, evening, midnight). Similarly, the crime dataset is labeled by latitude and longitude coordinates as well as census tract, which will be important variables for joining the two datasets together.

3.1 Summary Statistics

An important step to EDA is exploring the data through summary statistics. Since we have already examined the listings dataset thoroughly in Project 1, the following section will primarily focus on the crime dataset. Since we are interested in how crime levels affect Airbnb prices, it is important to take a closer look at the different types of crime. Below is a summary table of the number of crimes by offense and ward:

WARD ARSON ASSAULT W/DANGEROUS WEAPON BURGLARY HOMICIDE MOTOR VEHICLE THEFT ROBBERY SEX ABUSE THEFT F/AUTO THEFT/OTHER TOTAL
1 1 151 117 15 221 345 17 1443 1806 4116
2 1 119 168 0 225 232 26 1898 3467 6136
3 0 21 72 3 89 35 10 566 898 1694
4 0 60 110 4 176 136 13 857 863 2219
5 1 241 208 17 347 312 22 1441 1729 4318
6 2 150 119 16 248 309 23 1625 2411 4903
7 0 324 137 38 421 317 30 756 1191 3214
8 3 320 175 54 249 257 37 521 829 2445

Ward 2 has the most amount of crimes (6,136), followed by Ward 6 at 4,903 crimes. Moreover, crime type “Theft/Other” is the most common in all wards as seen in the bar chart above.

3.2 Data Visualization

To better understand the crime data and the underlying relationships, data visualization is a useful tool. This section presents two charts: a bar chart and pie chart. The bar chart is presented below:

Additionally, the same data can be visualized as a pie chart, which shows the percentages of the total number of crimes relative to each ward:

These summary statistics and charts are particularly important in the context of Airbnb listings. It is not unreasonable to hypothesize that wards with higher number of crimes overall may also exhibit an adverse effect on listing prices. As fewer people want to live in those areas, demand for Airbnbs decrease and in turn, so do prices. Further analysis using regression techniques will be needed to determine the overall effect of crime on prices.

4. Regression Models

After conducting EDA and looking at the variables at a high-level, we move onto generating regression models to estimate causal relationships between the two datasets. In this section, we examine four models using the techniques learned in class, beginning with a simple linear regression model within the Airbnb dataset alone. Then we move to overlay the crime dataset onto the listings dataset in order to explore how crime affects Airbnb listing prices. The penultimate model consists of a hedonic regression used to predict price. Lastly, we implement machine learning methodology to breakdown the primary drivers of price.

4.1 Linear Regression Model

We used linear regression models to explore the relationship between price and the other variables in the Airbnb data set.

To begin this portion of our analysis, we made a simple correlation plot to identify what correlations exist between the variables. The correlation plot shows that no strong correlations exist between the data as all of the values are close to zero. The strongest positive correlation is .28 between host listings count and avaliability 365. The strongest negative correlation is -.13 between price and number of reviews.

Before we continued on to build the linear regression models, we removed price outliers using the outlier function from Prof. Lo. By removing the outliers the data took on a more normal looking distribution.

Outliers identified: 919 nPropotion (%) of outliers: 11.2 nMean of the outliers: 882.65 nMean without removing outliers: 202.99 nMean if we remove outliers: 126.88 nOutliers successfully removed n

Next, we created a scatterplot comparing price and number of reviews to get a visual respresentation of the data and to see how it compares to the correlation we saw in the correlation plots. The resulting figure is shown below:

The scatterplot shows no significant linear relationship between price and number of reviews, suggesting that a linear regression may not be the best model for this data. Nonetheless, we soldiered on with the regression. The regression output is shown below:


Calls:
Model 1: lm(formula = price ~ number_of_reviews, data = datafit1)

==================================
  Constant            130.690***  
                       (0.950)    
  Number of Reviews    -0.094***  
                       (0.013)    
----------------------------------
  R-squared             0.007     
  F                    55.740     
  p                     0.000     
  N                  8207         
==================================
  Significance:   
                *** = p < 0.001;   
                ** = p < 0.01;   
                * = p < 0.05  

The estimated coefficient on the number of reviews indicates that an additional number of review will decrease listing price by 9.4 cents. This result is statistically significant; however, looking at the R2 value, only 0.7% of the total variation in price is explained by number of reviews. As such, there is likely other factors that contribute to price. After implementing multivariate regressions with five different combinations of relevant regressors, we will compare the results of the models and determine the best linear regression specification that fits the data. The regression outputs are shown below:


Calls:
Model 1: lm(formula = price ~ number_of_reviews, data = datafit1)
Model 2: lm(formula = price ~ number_of_reviews + calculated_host_listings_count, 
    data = datafit2)
Model 3: lm(formula = price ~ number_of_reviews + calculated_host_listings_count + 
    minimum_nights, data = datafit3)
Model 4: lm(formula = price ~ number_of_reviews + calculated_host_listings_count + 
    minimum_nights + availability_365, data = datafit4)
Model 5: lm(formula = price ~ number_of_reviews + calculated_host_listings_count + 
    minimum_nights + availability_365 + neighbourhoodCount, data = datafit5)

========================================================================================
                         Model 1      Model 2      Model 3      Model 4      Model 5    
----------------------------------------------------------------------------------------
  Constant              130.690***   128.019***   128.785***   123.988***   124.432***  
                         (0.950)      (1.011)      (1.034)      (1.201)      (1.404)    
  Number of Reviews      -0.094***    -0.085***    -0.089***    -0.105***    -0.105***  
                         (0.013)      (0.013)      (0.013)      (0.013)      (0.013)    
  Host Listings Count                  0.326***     0.349***     0.242***     0.243***  
                                      (0.043)      (0.044)      (0.046)      (0.046)    
  Min. Nights                                      -0.132***    -0.168***    -0.168***  
                                                   (0.038)      (0.038)      (0.038)    
  Availability                                                   0.051***     0.050***  
                                                                (0.006)      (0.006)    
  Neighborhood Count                                                         -0.002     
                                                                             (0.003)    
----------------------------------------------------------------------------------------
  R-squared               0.007        0.014        0.015        0.022        0.022     
  F                      55.740       56.672       41.784       46.709       37.439     
  p                       0.000        0.000        0.000        0.000        0.000     
  N                    8207         8207         8207         8207         8207         
========================================================================================
  Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05  

Looking at the various models, one trend is apparent. As we include more relevant regressors, the R2 value continues to increase. However, the R2 value is capped at 0.022, suggesting that, at best, only 2.2% of the total variation in price is captured by the regressors – not good!

Analyzing the individual regressors shows that many of them are statistically significant. Generally speaking, the number of reviews and minimum nights negatively impact listing price. In other words, as number of reviews and minimum nights increases, price is decrased by an average of 9 cents and 15 cents, respectively (depending on which model is used). This result is not terribly surprising, considering that more popular Airbnbs (ones with many reviews) may be cheaper. Additionally, Airbnb hosts often offer discounted prices for monthly rentals or even annual rentals (i.e., minimum nights = 30 or 365), so prices are lower in those cases as well.

On the other hand, host listing count and availability positively affect the listing price by an average of 27 cents and 5 cents, respectively. These results are also not surprising. Higher host listings count suggest that the owner of the Airbnb is a commericial owner and may have more experience with Airbnb clients, therefore can price more to the market. Additionally, they may offer more amenities in their listings, driving price up as well. Furthermore, availability is seen to increase prices, likely due to the fact that Airbnb hosts that have their listings available more often will have more traffic, driving price up.

In addition to understand and interpreting the regression output, it is also crucial evaluate the models and choose a specification that fits the data best. Looking at the R2 values, Models 4 and 5 explain the most variation, both at 2.2%. However, since neighborhood count is not statistically signficant, the best model is Model 4, since adding not stastically signficant regressors may put the model at risk of overfitting. In all, number of reviews, host listings count, minimum nights, and availability are included in the regression model for price. However, since the R2 values are still low, our team turned to outside factors – crime – to try and capture more of the variation in price.

4.2 Crime and Listing Regressions

One key area of interest is examining the connections between crime and Airbnb prices. There are two primary hypotheses on how crime affects listing prices: (1) higher crime rates reduce demand and lower listing prices; and (2) crime is targeted in wealthier neighborhoods that have higher listing prices.

The conventional thought process is that higher crime areas will drive potential customers away. As a result, there will be less overall demand for Airbnb’s in that neighborhood, leading to a decline in prices. While this mechanism makes sense at face value, deeper thinking about areas where crime exists and is most prevalent may point to a different directionality. One may argue that more affluent, urban neighborhoods may exhibit higher crime rates – especially in terms of home or auto theft. Assuming that wealthier and more accessible neighborhoods will have higher Airbnb listing prices, then higher crime rates is a subsequent reaction to higher prices, contrasting the conventional mentality.

To test these hypotheses and to determine which one explains the true relationship, we will first merge the crime dataset with the listing dataset and then construct regression models to estimate the effect of crime on listing prices. Prior to any regression analysis, it is always important to take a cursory glance at the summary statistics and simple data visuals in order to get a sense of how the variables relate to one another. Below is a breakdown of the percentage of crimes by zip code:

As shown above, even within the the sample of the ten zip codes with the most crime occurences, the first zip code (20004; 1,145 total crimes) has more than double the number of crimes than the tenth zip code (20002; 504 total crimes). Turning our attention to how prices differ, we take a look at the summary statistics for listing price in these two neighborhoods, first in the form of a five number summary, then in the form of a boxplot:

Summary Statistics 20002 (low) 20004 (high) 20009 (low) 20010 (high)
Price            
   min 29 48 20 30
   max 1450 1050 1000 1223
   median 115 192 119 99
   mean (sd) 171.55 ± 190.82 235.28 ± 179.08 162.26 ± 144.18 140.57 ± 145.03

Interestingly enough, looking at the basic summary tables of the average prices in the two highest crime zip codes (20004 and 20010) and lowest crime zip codes (20002 and 20009), the former has a higher average listing price. This result suggests that perhaps contrary to conventional wisdom, it is indeed the case that crime is targeted in more affluent neighborhoods, if we assume that higher listing prices are correlated with weathlier neighborhoods (that may be the subject of a future study). Of course, this table and subsquent boxplot only captures a very small sample – too small to make any substantive claims about crime and listing price. In fact, if we had any such claims, we would be falling into the classic trope of correlation implying causation, which we have well learned to be false. As such, we now turn our attention to the most important topic on hand: regressions.

We begin first by building a simple regression model to estimate the effect of crime on listing price. After assessing this preliminary model using various evaluation techniques, we aim to tune the model, either through linearization or adding additional relevant regressors to the model. Below are the regression results from the simple linear regression of price vs. total crime:


Calls:
Model 1: lm(formula = price ~ total_crimes, data = lm1_input)

==========================
  Constant    237.696***  
              (11.740)    
  Crimes       -0.024     
               (0.020)    
--------------------------
  R-squared     0.000     
  F             1.432     
  p             0.232     
  N          4690         
==========================
  Significance:   
                *** = p < 0.001;   
                ** = p < 0.01;   
                * = p < 0.05  

The OLS estimate for the number of crimes on listing prices produces a coefficient of -0.024, suggesting that increasing the number of crimes by 1 will decrease the listing price by 2.4 cents. However, this coefficient is not statistically significantly different from 0. Additionally, this model yields an R2 value of 0.00, indicating that only none of the variation in price is captured in this regression specification. Moreover, the p-value is much larger than the alpha level of 0.05, suggesting that the overall model is not significant.

While the estimated effect of total crime on price does not show any significant results and therefore cannot shed light on either of the hypotheses, the sign of the coefficient does suggest that the conventional theory might be the true story. However, the low R2 value shows that there is still much work that can be done to improve the model. Now that we’ve established the connection between crime and prices, it is also interesting to determine what type of crime affects prices the most. Below is the regression output of price vs. type of crime:


Calls:
Model 1: lm(formula = price ~ total_crimes, data = lm1_input)
Model 2: lm(formula = price ~ theftOther_rate + theftAuto_rate + robbery_rate + 
    motorTheft_rate, data = lm1_input)
Model 3: lm(formula = price ~ theftOther_rate + theftAuto_rate + robbery_rate + 
    motorTheft_rate + number_of_reviews + as.factor(room_type), 
    data = lm1_input)

=======================================================================
                                  Model 1      Model 2      Model 3    
-----------------------------------------------------------------------
  Constant                       237.696***   253.305***   321.940***  
                                 (11.740)     (13.274)     (13.703)    
  Crimes                          -0.024                               
                                  (0.020)                              
  Theft (other)                                32.073***    18.831**   
                                               (6.375)      (6.281)    
  Theft (from auto)                             3.059       -0.642     
                                               (7.815)      (7.646)    
  Robbery                                    -223.136***   -74.026     
                                              (66.169)     (65.246)    
  Motor Theft                                -215.027*    -255.909*    
                                             (103.999)    (101.468)    
  Number of Reviews                                         -0.750***  
                                                            (0.081)    
  Private Room/Entire Home/Apt                            -148.542***  
                                                           (13.004)    
  Shared Room/Entire Home/Apt                             -214.810***  
                                                           (32.436)    
-----------------------------------------------------------------------
  R-squared                        0.000        0.010        0.059     
  F                                1.432       11.910       42.195     
  p                                0.232        0.000        0.000     
  N                             4690         4689         4689         
=======================================================================
  Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05  

Looking at Model 2, which breaks down each of the crime categories and incorporates them into the regression specification, the estimates yield a fascinating result. Examining only the statistically significant coefficients shows that robbery and motor theft negatively affect listing prices, while home theft (other theft is equivalent to home theft) increases listing prices. More specifically, a one percent increase in robbery and motor theft rate leads to an estimated decrease in listing price by $223.14 and $215.03, respectively. On the other hand, a similar one percent increase in home theft leads to an increase of $32.07 in listing prices.

The coefficients seem to suggest that both hypotheses may have some merit in this discussion. It is intuitive that neighborhoods with higher robbery and motor theft rates will have lower listing prices. This result corroborates with the idea that fewer people will want to live in neighborhoods where personal and property safety is at risk. However, it also makes sense that home theft may be associated with higher listing prices. After all, wealthier neighborhoods with more luxury goods at home may very well be bigger targets for home invasion. The last category of theft from automobiles is not statistically significant, which is reasonable as theft from cars likely is not associated with listing prices.

To combine previous models with crime, we have also included the number of reviews and type of room into the regression equation. Unsurprisingly, private rooms and shared rooms lead to a signficantly lower listing price compared to entire homes/apartments. Somewhat more surprising is the fact that increasing the number of reviews actually leads to a decrease in prices. This result may be due to the fact that bad experiences (i.e., ones where guests would be compelled to write a review) may far outnumber good experiences and as such, prices are lower for poorly reviewed listings. More interesting is the fact that adding these two variables causes crime to be reduced to near insignificance. Home theft and motor theft remain signficant, with the effect of home theft nearly halved and motor theft increasing its effect by $40. This result suggests that listing type may be a larger driver of listing price than crime rates.

4.3 Hedonic Regression Model

For our teams’s third model, we wanted to use a regression technique commonly used in the field of economics: hedonic regression. Hedonic regression is a revealed-preference method used in economics and consumer science to determine the relative importance of the variables which affect the price of a good or service. To start, we used basic data visualization techniques to draw out any clear associations between the variables. Below is a pairwise scatterplot of the relevant regressors:

A cursory glance at the scatterplots show no clear relationship between any of the variables. Moving forward, we implemented the hedonic regression specification and ran a couple of model diagnostics to determine whether the model is a good fit. The regression output and diagnostic charts are below:


Calls:
Model 1: lm(formula = price ~ room_type + number_of_reviews + availability_365 + 
    minimum_nights + reviews_per_month + calculated_host_listings_count, 
    data = data)

=============================================
  Constant                       188.146***  
                                  (3.386)    
  Private Room/Entire Home/Apt   -98.713***  
                                  (4.191)    
  Shared Room/Entire Home/Apt   -194.365***  
                                 (12.421)    
  Number of Reviews               -0.059     
                                  (0.037)    
  Availability                     0.080***  
                                  (0.015)    
  Min. Nights                     -0.717***  
                                  (0.107)    
  Reviews Per Month              -11.999***  
                                  (1.131)    
  Host Listings Count              3.089***  
                                  (0.138)    
---------------------------------------------
  R-squared                        0.180     
  F                              229.203     
  p                                0.000     
  N                             7298         
=============================================
  Significance: *** = p < 0.001;   
                ** = p < 0.01; * = p < 0.05  

The regression estimates show that all but number of reviews is statistically significant in predicitng listing price. Room type, minimum nights, and reviews per month negatively affect price, while availability and host listings count positively affect price. These results largely echo the results of the linear regression from section 4.1. However, the R2 value has increased substantially to 0.18, indicating that 18% of the variation in price is now explained through this model.

                                   GVIF Df GVIF^(1/(2*Df))
room_type                      1.036775  2        1.009070
number_of_reviews              1.739067  1        1.318737
availability_365               1.092794  1        1.045368
minimum_nights                 1.023774  1        1.011817
reviews_per_month              1.739570  1        1.318927
calculated_host_listings_count 1.085557  1        1.041901

Looking at the plots, outliers do appear to be affecting the regression. As such, for the next iteration of the model specification, we have chosen to eliminate the outliers present. Additionally, the VIF test shows values that are all less than 10, indicating that there is not much multicollinearity between the regressors – good news! We ran the same model with the updated dataset:


Calls:
Model 1: lm(formula = price ~ room_type + number_of_reviews + availability_365 + 
    minimum_nights + reviews_per_month + calculated_host_listings_count, 
    data = data)
Model 2: lm(formula = price ~ room_type + number_of_reviews + availability_365 + 
    minimum_nights + reviews_per_month + calculated_host_listings_count, 
    data = data4)

==========================================================
                                  Model 1      Model 2    
----------------------------------------------------------
  Constant                       188.146***   184.579***  
                                  (3.386)      (3.008)    
  Private Room/Entire Home/Apt   -98.713***   -95.383***  
                                  (4.191)      (3.710)    
  Shared Room/Entire Home/Apt   -194.365***  -191.502***  
                                 (12.421)     (10.996)    
  Number of Reviews               -0.059       -0.052     
                                  (0.037)      (0.032)    
  Availability                     0.080***     0.072***  
                                  (0.015)      (0.013)    
  Min. Nights                     -0.717***    -0.859***  
                                  (0.107)      (0.108)    
  Reviews Per Month              -11.999***   -11.340***  
                                  (1.131)      (1.002)    
  Host Listings Count              3.089***     3.172***  
                                  (0.138)      (0.122)    
----------------------------------------------------------
  R-squared                        0.180        0.214     
  F                              229.203      283.785     
  p                                0.000        0.000     
  N                             7298         7290         
==========================================================
  Significance: *** = p < 0.001; ** = p < 0.01;   
                * = p < 0.05  

After eliminating the outliers for Model 2, the regression coefficients are largely unchanged, but the R2 value increases by 3%, from 0.18 to 0.21. In addition, the diagnostic plots and tests show the same results as before:

                                   GVIF Df GVIF^(1/(2*Df))
room_type                      1.037025  2        1.009130
number_of_reviews              1.738428  1        1.318495
availability_365               1.094066  1        1.045976
minimum_nights                 1.028644  1        1.014221
reviews_per_month              1.741563  1        1.319683
calculated_host_listings_count 1.086272  1        1.042244

To see how these variables affect listing price, we chose an example and ran it through the regression equation. Since room type seems to be the primary driver, only that variable is changed. The others are:

  • Number of reviews = 10
  • Availabilty = 365
  • Minimum nights = 1
  • Reviews per month = 0.5
  • Host listings count = 1

With a room type of “shared room”, the model predicts a listing price of $15.56. With a private room, the price increases to $111.68; and finally, with room type of “entire home/apt,” the predicted listing price is $206.06.

4.4 Machine Learning

Apart from building regression models, our team also wanted to implement machine learning techniques for dimension reduction and variable selection in predicting price quantile. This section begins by constructing a decision tree that will split the data algorithmically. Following the tree, we implement the KNN procedure in determining price quantile.

The results of the tree classification are shown below along with a visual representation of the decision tree. Again, we predict price quantile using various regressors including: total crime count, minimum nights, number of reviews, availability, and host listing count.

Looking at the decision tree, the nodes are split using number of reviews, host listing count, minimum nights, and total crime count. Overall, the nodes do not appear to be predicting the quantiles very well. Further analysis using a confusion matrix is needed to see the overall accuracy of the tree.

[1] "Overall: "
      Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
  3.982343e-01   1.976599e-01   3.860482e-01   4.105158e-01   2.500803e-01 
AccuracyPValue  McnemarPValue 
 2.321758e-145   6.898192e-50 
[1] "Class: "
         Sensitivity Specificity Pos Pred Value Neg Pred Value Precision
Class: 1   0.3029525   0.8880565      0.4743719      0.7925501 0.4743719
Class: 2   0.4213231   0.7554034      0.3646470      0.7966599 0.3646470
Class: 3   0.3177150   0.7615582      0.3076445      0.7699632 0.3076445
Class: 4   0.5510597   0.7926386      0.4696223      0.8412446 0.4696223
            Recall        F1 Prevalence Detection Rate
Class: 1 0.3029525 0.3697611  0.2500803     0.07576244
Class: 2 0.4213231 0.3909416  0.2499197     0.10529695
Class: 3 0.3177150 0.3125987  0.2500803     0.07945425
Class: 4 0.5510597 0.5070922  0.2499197     0.13772071
         Detection Prevalence Balanced Accuracy
Class: 1            0.1597111         0.5955045
Class: 2            0.2887640         0.5883632
Class: 3            0.2582665         0.5396366
Class: 4            0.2932584         0.6718491

As predicted, the overall model accuracy is less than 1%, signaling that the decision tree does not offer any insight in terms of predictive ability. Below are the cross-validation results:


Classification tree:
rpart(formula = quantile ~ Total + `Min. Nights` + `Number of Reviews` + 
    Availability + `Host Listing Count`, data = listingsApartment)

Variables actually used in tree construction:
[1] Host Listing Count Min. Nights        Number of Reviews 
[4] Total             

Root node error: 4672/6230 = 0.74992

n= 6230 

        CP nsplit rel error  xerror      xstd
1 0.115368      0   1.00000 1.02312 0.0071392
2 0.036815      1   0.88463 0.89255 0.0079479
3 0.025471      2   0.84782 0.85338 0.0081094
4 0.019906      3   0.82235 0.83690 0.0081674
5 0.010000      4   0.80244 0.80651 0.0082595

Again, these summaries and plots reaffirm that this model does not hold much predictive power. Moving forward, we implement the KNN methodology to see if it yields better results. The first output uses a k = 7.


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2111 

 
             | listings_pred 
  testLabels |         1 |         2 |         3 |         4 | Row Total | 
-------------|-----------|-----------|-----------|-----------|-----------|
           1 |       256 |       130 |        86 |        65 |       537 | 
             |     0.477 |     0.242 |     0.160 |     0.121 |     0.254 | 
             |     0.371 |     0.259 |     0.207 |     0.129 |           | 
             |     0.121 |     0.062 |     0.041 |     0.031 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
           2 |       197 |       169 |       106 |        73 |       545 | 
             |     0.361 |     0.310 |     0.194 |     0.134 |     0.258 | 
             |     0.286 |     0.337 |     0.255 |     0.145 |           | 
             |     0.093 |     0.080 |     0.050 |     0.035 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
           3 |       140 |       124 |       139 |        95 |       498 | 
             |     0.281 |     0.249 |     0.279 |     0.191 |     0.236 | 
             |     0.203 |     0.248 |     0.334 |     0.188 |           | 
             |     0.066 |     0.059 |     0.066 |     0.045 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
           4 |        97 |        78 |        85 |       271 |       531 | 
             |     0.183 |     0.147 |     0.160 |     0.510 |     0.252 | 
             |     0.141 |     0.156 |     0.204 |     0.538 |           | 
             |     0.046 |     0.037 |     0.040 |     0.128 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|
Column Total |       690 |       501 |       416 |       504 |      2111 | 
             |     0.327 |     0.237 |     0.197 |     0.239 |           | 
-------------|-----------|-----------|-----------|-----------|-----------|

 

Using k = 7, the overall accruacy of the KNN methodology is 39.55%. Next, we try to find the optimal k-value that will give us the highest accuracy rate. For that, we implement Dr. Lo’s “chooseK” function.

Looking at the graph above, a k-value of 11 and 21 yield the highest accuracy rate of 40.5%. As such, the best model using the KNN methodology is one that uses k = 11.

5. Conclusion

Now that we have finished our rigorous, quantitative of the analysis, we turn our attention to deciding which method best suits the question at hand: determining the variables that affect Airbnb listing price.

Our first jab at modeling using both univariate and multivariate regression specification did not yield promising results. R2 values were generally low due to the fact that the data in hand did not have a good linear relationship.

As such, we moved to incorporating crime statistics to determine if those would be strong drivers of price. After adding crime rates of different types of offenses into the model specification, we found certain crimes (namely home theft and motor theft) did have a statistically significant effect on listing price. However, with the addition of room type and number of reviews, those previously significant estimates on crime faded to insignificance, indicating that internal Airbnb factors (room type and reviews) may still be the primary drivers of price.

As such, we followed up on the crime rate analysis using a hedonic regression model. That model produced statistically signficant results and determined that the primary driver of price is in fact, room type. As evidenced by our demonstration, an entire home/apartment is more than 10 times the cost of a shared room.

Lastly, we took a step back and implemented machine learning techniques for variable selection in order to find the main determinants of price. Unfortunately, both the decision tree and the KNN model yielded unconvincing results with low accuracy rates.

In summary, we find that room type is the primary driver of listing price and the hedonic model is best suited for the data.